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A H.264 standard is one of the most popular coding standard with significant 
improvement in video broadcasting and streaming application. However it’s 
significant in compression but needs huge calculation and complex algorithm 
for providing better image quality and compression rate. In H.264 coding 
technique, designing of decoder is a key factor for efficient coding. In this 
paper we are designing a decoder using a complex input. We ensured several 
improvement like looping arrangement, buffer upgradation, buffer 
supplement, memory reusability and pipelining architecture. We have 
modified the memory structure also. Our designed decoder achieves a better 
frame decoding efficiency against state-of-art methods. The proposed 
approach also provides good area optimization with a maximum frequency of 
355 MHz. 
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1. INTRODUCTION 

H.264 is a most broadly used standard for video coding with significant improvement in video 
broadcasting, video streaming and optical disc. It is established by JVT (Joint video Team) of ITU-T and 
ISO/IEC and also known as MPEG-4 part 10 advance video coding. Most of the bits transmitted wirelessly in 
a communication networks uses MPEG-4 part 10 advanced video coding (AVC). H.264 video coding 
comprises high compression efficiency, so it is most frequently used in video coding. It has some new 
features including inter-prediction, intra prediction, variable block size and context-based adaptive entropy 
coding [1]. These all new feature needs complex compression algorithm and huge calculations to provide 
better image quality and compression rate. H.264 coding technique involves source code from different 
domain like computational physics, computer science and machine learning approach, which makes complex 
source code and challenges in synthesis. The complexity of the H.264 decoder is increased lot compare to 
MPEG-4 decoder. 

Usually a H.264 decoder contains a pipelining architecture of 4*4 sub block. The data processing 
time and the complexity of each stage of pipelining architecture depends on the type of data and decoding 
methods. The timing of each stages should be known, so that the stages which requires more processing time 
can be normalized by efficient decoder architecture. An efficient decoder architecture can idealize all the 
time consuming stages by reducing the complexity up to certain level.It is observed by run time analysis that 
the motion compensation (MC) uses 55% of the decoding time. So, it’s a crucial factor in designing a 
decoder architecture [2-3] considering performance. Reading pixel data with less complexity can increase the 
performance of the decoder. H.264 decoders usually have three units. One is the MC unit discussed above, 
second is the deblocking filter unit and the last is data out unit which helps in transferring image data on 
display device. 
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The size of code, complexity of data structure and function hierarchy should be limited as much as 
possible. There are different methodologies, which affects the efficiency of decoder. Some researchers 
followed bottom-up methodologies which includes block level design along with system-level designing. 
Whereas some researchers followed top-down methodologies which includes the whole function as a single 
unit. There is also difference between HLS generated hardware and standalone hardware. So, the different 
methodologies uses in different goal and application. A complex application may be more effective with 
approaches that breaks the code piecewise and optimize each piece of code where as a simple application 
may require more methods to get effective.so, it’s completely depends to purpose of application. 

To reduce the memory scope, several optimization are done by researchers. Some of the HEVC 
hardware interpolation is presented in [4-7], where they have compared different techniques. In their 
approach, they did not used memory based implementation. In [4], they have implemented three different 8- 
tap FIR filter by using a configurable path, which can evaluate single filter output at a time. Due to this 
reason, it can be used for only motion compensation. In [5], the presented hardware design uses multiplier 
less constant multiplication (MCM) approach for multiplying with constant factor in [6-7], they have used 
adders and shifters for generating FIR filters. 

In this paper we have demonstrated H.264 video coding with its real application and the types-of 
complexity which we face on top down approaches. Designing of decoder core is very important in fast and 
power efficient decoding. Many online platform like YouTube and Facebook using h.264 technique for video 
coding. Popular manufacturing company like apple and snapdragon also uses H.264 video coding for their 
processors. This paper includes that how we are improving the design process and what difficulties we are 
facing while designing a decoder. We synthesized the code, optimized the code and achieved a throughput 
which outperforms the state-of-art techniques. 

This paper is organized as following way. In section 2 we have demonstrated a brief related work of 
designing a decoders. In Section 3, an overview of H.264 decoder is presented. Section 4 describes our 
proposed optimization techniques of designing a H.64 decoders. The performance and result evaluation is 
shown in Section 5. In last section we conclude our paper. 


2. RELATED WORK 

HLS tools much hyped to achieve prototyping and rapid designing of hardware in register transfer 
level. J. Andrade et al [8] has proved this claim by using a complex application designing, which implements 
low-density parity-check (LDPC).HLS tool can help user to explore large space designing into multiple small 
designing, which make its productivity high. HLS tool can also explore micro architecture of the generated 
design. LDPC decoders are developed by using this HLS tools only, which has an average throughput. 

S. Baldev et.al [9] has developed an efficient 5-stage pipelining architecture of daglocking filter for 
design of HEVC decoder. The luma/chroma samples are applied vertically on edge filters of the designs to 
get maximum throughput and minimizes the number of clock cycle. This proposed architecture is developed 
in FPGA and ASIC platform using 90-nm technique. The result of this propose architecture shows that the 
UHD videos are decoded at 200fps. 

F. Leduc-Primeau et.al [10] has developed a design approach, known as quasi-synchronous design 
approach .LDPC decoder allows timing violation, which is modified through proper modeling by HLS. The 
new designed circuits can provide same performance with same area constraint but having an energy 
reduction of 32%. 

Designing of Node Processing Units in LPDC decoder [11] is very important for both hardware 
resources and processing experiences. NPU architectures supports decoder in keeping low hardware 
utilization with maximum operating frequency. The synthesis outcome proves the hardware efficiency for 
proposed architecture. 

T. Mallikarachchi et.al [12] has proposed a framework to reduce the complexity of decoding. By 
reducing the complexity of decoder, they have reduced the energy consumption during media playback. They 
also improved in bit-rate and video quality by designing this framework. 

H. Kim et.al [13] has developed an efficient architecture of HEVC for supporting ultra-high 
definition content by multicore implementation. In standard HEVC technique there is issue of data 
dependencies, which makes it inefficient for parallel processing. The novel architecture of memory 
organization solves the problem of data dependencies and makes its efficient for parallel processing. They 
have implemented de-blocking filter with skip mode pipelining to achieve high performance of throughput. 
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3. H.264 STANDARD VIDEO DECODING 

A H.264 standard has a number of profiles, which covers different encoding features, frame rates 
and resolutions. Thus, it’s mandatory to define the profile, while designing a decoder, so that, they will 
support for that specific profile. In this paper, we are designing a decoder for the main profile, normally used 
in standard video streaming and broadcasting. 

Input of the decoder is an encoded YUV video containing color spaced pixels, where Y represents 
luminance whereas U and V represents chrominance component. Human eyes are more sensitive to 
brightness than colors, for this reason chrominance data is considered for improving encoding efficiency. 

In H.264 standard, video is encoded frame wise, where I-frames are encoded without any 
information of past and future frames. Encoding of P-frame requires information of previous frame whereas 
encoding of B-frame requires information of past frame as well as future frames also. Encoded video is 
stored in form of bit stream. File format of encoded video provides information about each frame’s type. A 
H.264 input file format is shown in Figure 1. 
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Figure 1. Structure of H.264 encoded file 


SPS and PPS unit of input file (frame) contains information regarding decoding parameters and 
frame size. IDR unit of the input file is the first slice where frame is further divided into macro blocks. Each 
of the slice header having basic information of slice like slice identifier, number of macro blocks, 
quantization factor and frame configuration. 

A Decoder gets compressed bit-stream from the encoded input file where entropy decoder decodes 
the input bit stream into a set of quantized coefficient. The residual image information can be obtained by 
using inverse quantization and inverse transformation unit. The combined information of residual data, pre- 
decoded data and prediction information is utilized in final decoded image. 

In this paper, we mainly focus on getting higher resolution decoded image with a high frame rate. 
For this purpose we need to consider a design process, where we implement the parameterized buffer for 
temporary storage solution. A H.264 video decoder functional block diagram along with intra prediction unit 
is shown in Figure 2. 
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Figure 2. H.264 Decoder functional block diagram 
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4. IMPROVED H.264 DECODER WITH SIMULATION ARCHITECTURE 

In this section we will discuss about the unique designing and methodology to get improved 
implementation of H.264 decoder. We are mainly concentrating on difficulties in performing to designing a 
module, which can decode our video frame efficiently. In total designing process, a use case has considered 
for the hardware module which includes requirement verification, code writing and iterative source 
optimization with single function or multiple function along with system level optimization. The simulated 
architecture of H.264 decoder is shown in Figure 3. The architecture shows the connectivity of external 
memory with shared bus and interface unit. 
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Figure 3. Simulated H.264 decoder architecture 


4.1. Improved Synthesis Process 

In this section we will start improving the application iteratively by improving the code. We will 
first improving the design for area minimizing by implementing maximally parallelize and pipelining 
architecture. Though, this architecture cannot be used in complex application, because applying global 
parallelization makes area cost too high. Due to this reason, we should more selective in optimizing area and 
optimizing performance in complex application. 

We have written our code by the help of Xilinx ISE tool in Verilog hardware description language. 
It’s a tool which gives environment to develop the design code, synthesize the code and simulate the 
developed design code. 


4.1.1 Building Individual Function 

We start building every individual function in a fashion, which helps in getting performance on call. 
We have divided all single function into groups and performed several enhancement. We concentrated more 
on performance optimization rather than area optimization. 


4.2. Memory Reusability with Pipelining Architecture 

In order to make memory references efficient in reference software CPU implementation is a key 
factor. The hardware implementation of shared memory might be expensive, even in case of local BRAM 
block also, which may cause a performance bottleneck of the system. Thus, by applying splitted individual 
register or small local data array reusability can be an important factor in system performance. The decision 
of splitting registers or small local data array implementation depends on three factors. First, the functions 
which involved in array should be significant function individually. Second thing the functions which are 
involved in array should have data-parallelism and memory port limitation should not be there. The last thing 
is the implementation of BRAM. The reducing BRAM can be more effective than preventing the register use. 

This strategy is applied in our design implementation. But, if the local buffer size is not completely 
partitioning feasible, in that case we create an additionally local buffers which can be reused efficiently. In 
our code, in each iteration we read 5-6 overlapped data items where only one data item is new. Therefore, by 
creating local buffer we read a new data in each iteration. 

Till now, improvement is done for efficiency and memory reusability by low level function calling. 
Now, we can further improve our system for area constraints. A pipelining architecture helps a lot in 
improving performance as well as significant area optimization. We have pipelining based on the number of 
iterations to improve further performance of the system along with area optimization. 
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4.2.1 Improving Through Cross Function Examination 

We start developing function sequences any analyzed their importance in application. In descending 
order of significance, we can improve the leaf function sequence, which is more effective in enhancing 
important call sequences. Profiling data is updated and there after we find out the different sequences of 
function calls. These function call might be the improved leaf function or other several function including 
call stack. Each function latency is analyzed along with the number of calls on call stack. This data helps us 
in finding critical path in call stack. 


4.2.2luma and Chroma Parallel Data Flow Design 

Figure 4 demonstrates the Luma and Chroma parallel data path unit, where buffers are utilized. 
Weight predictor estimates the weight for both Luma and Chroma data samples where data is stored in small 
buffers. Average sum of the predicted weight, buffer data and individual predicted weight is applied to 
multiplexer for finalizing the output. 


Luma Dataflow 


Chroma Dataflow 
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Figure 4. Luma-Chroma Data flow design 


4.3. Looping Rearrangement and Function Inlining 

There are some factors, which can stop the pipelining completely in loop. The control flow of the 
loop may degrade the performance by applying conditional check in each loop iteration. Due to these reason, 
we reorder every loop where optimization of both the latency is needed considering the pipelining suitability. 

During single function optimization, it has been found in several cases, that there is a need of 
function inlining for both latency and area optimization. But it is not necessary that there is always a 
significant benefit of specialization would be there by function inlining. Due to duplication of resources, a 
tradeoff is there between function call overheads and increased area. By implementing inlining tool, we can 
achieve universally inline or we can prevent inlining of a function. In our case, we tried to get maximum 
benefit in each call site. However, in some cases it is beneficial only after inlining a subset of frequently used 
function call sites. 

After identifying important positions of potential saving, we implemented function inlining. Using 
profile data and call site positions, it is observed that the significance of all call sites for candidate function. 
We open multiple alternative option of inlining directives which helps in finding latency saving and area 
cost, But we have finalized inline choices by inlining manually for the implemented function. 

Buffer upgradation 

As we discussed earlier, in improving of single function buffer role is very significant to execute 
parallelism algorithm. Initially, we start from local buffer insertion to improve single leaf function. Here, we 
are not considering the places, where to define the buffer for getting maximum benefit of parallelization 
during the call stack. We just evaluate the individual buffer to define in call stack. Buffer can affect 
parallelism of sub-functions at higher level of call stack, but at the same time it leads complexity in inter- 
block edges, which experiences overhead while copying form global edges. In this situation, a confliction 
comes in sharing of potential data, because multiple sub-functions reuses same buffer. We analyze the call 
stack of each local buffer and try to define it earliest places, where overhead is minimum and sub function 
gets maximum benefit during parallelism. 
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4.4. System Level Parallel Processing 

Once call stack improvement is done, we improved every single function along with the important 
sequence of function in call stack. Now instead of optimizing less significant function, we start optimizing 
data buffering and data transportation across different portions of decoder, which is not directly connected 
with function call stack. 

After data optimization we find out top-level dependencies in profiling data, which helps in finding 
data dependency between buffers of top level function in call stack. 


4.4.1 Buffer Supplement and Task Level Parallelism 

As we discussed earlier, the input data stream is complex, so after call stack and local buffer 
insertion, the external data evaluation is critical factor in performance evaluation. That’s why we want to 
develop a core system, which should be independent from target resolution, where some part of memory can 
be accessed through external bus for data transfer rather than using local memories. Moreover, implemented 
local buffer are well optimized for parallelism in local function. So, there is no need of having facilitate 
reusing throughout call state region. 

By using the statistics of profile information, critical function is determined and clustering is 
developed for function call graph. Additionally, with the help of this clustering, a local data array is created 
for system-level, which is comparatively large and used by directly or indirectly. Directly use of this larger 
data can be possible when, there is no need of substantial parallelism whereas indirectly use of data can be 
possible by copying data into next level of buffer. Copying of data into next level buffer may experiences 
overhead, but it protects from latency produce by localization and reusability. Later we find out those sub 
function, which implements local buffer based on the observation of their buffering necessity and processing 
time. 

In our designing process, we have optimized first individual function or call sequences and ignored 
task-level parallelism. Later we considered, task-level parallelism by implementing two methods which are 
(a) Buffer duplication and (b) interface duplication. We use buffer duplication method in case of using input 
data by multiple function whereas interface duplication method in case of partitioning data into two or more 
group accessed individually by function. 


4.5. Runtime Memory Allocation 

In order to minimize the memory allocation problem H.264 uses a dynamic memory allocation. In 
our use-case also, we have used dynamic memory allocation for internal buffers which will convert in 
parameterized static allocations. The usage of dynamic memory allocation depends on the size of the input 
file resolution. It’s not good to design a decoder which will support only one maximum resolution. To 
overcome this problem, we have to redesign our model for each target case resolution which increases 
optimization challenges. A large amount of input buffer converts into top level memory interfaces, which 
will use for memory banks. It might need of transferring data using memory interface. That’s why we 
optimizing kernel’s without considering the video resolution. 


5. RESULT AND ANALYSIS 

We used Xilinx VCU1525 development kit as a synthesizable software for optimization of each 
intermediate stage of designing. We performed on-board verification using Omnitek Zynqg 7000. ARM CPU 
is utilized for data travelling while Zynq acts as a standalone FPGA. 

Vivado 2015.2 tool is applied to determine the occupied area and operating frequency for each 
design process. In order to measure performance of input video files and the corresponding board-level 
application, simulation has performed in H.264 video file. H.264 video file follows a repeating pattern like 
one frame is I-frame, next is P-frame and then B-frames and again P-frame. The latency of decoded frame 
depends on data for almost all frame type. We try to find out the worst case latency (maximum latency) of 
each frame type in each cycle. A weighted average is performed twice for B-frames compare to I or P frames. 
Later, the average latency of each cycle is multiplied with obtain frequency to check the average latency got 
per frame. Moreover, we have used Intel core 15-2310 CPU (2.9GHz) to complete the required modification. 

Because, our core design is independent of resolution, we tested our design with multiple resolution 
like QCIF 144p and 480p of input files. We estimated system performance for different resolutions. We also 
covered the average latency for per micro block of the system. We have compared our performance with 
different pre-developed H.264 implementation. 
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5.1. Performance Evaluation and Analysis 

We synthesized our design and evaluate performance while treating with QCIF input videos. At 
each optimization stage, occupied area is presented with performance for corresponding synthesized 
implementation. By implementing single function improvement, a little area gets optimized but it also 
improves the CPU performance, which indicates reusing function is well organized. Though, optimization of 
cross-function with the help of local buffer making the pipelining more effective rises performance 
excellently. The area depends primarily on FF, DSP and LUT units and SRAMs are utilized mainly in 
system-level buffer but rarely used in local buffers. Optimization in system level configuration makes the 
performance low in CPU level depicts a deviation in our desire result. In final step of parallelization, we 
achieve good throughput. 

However, our simulation process do not design a memory bandwidth at system level. H.264 system 
requires a lot of computation process, so, it should not be a memory bounded application. In experiment, 
Xilinx kintex-7 FPGAs is demonstrated to get a good area optimization with maximum frequency. 


Table 1. The number of luma and chroma intra prediction unit comparison is depicted 








Design LUT FF 

luma [14] 545 127 
luma[15] 212 37 
luma [proposed] 83 284 
chroma[14] 216 59 
chroma[15] 163 33 
chroma [proposed] 284 105 





Table 2. Comparison of H.264 decoders with other H.264 decoders and HEVC decoders are shown 








Our ESSCIRC = ISSCC ASSCC ISSCC VLSI 10 [ ISSCC 10[ Work done[ 

proposed 14 16] 13[17 ] 13[18] 12[19 ] 20] 21] 22] 

work 
Video- H.264 HEVC + HEVC HEVC H.264 H.264 H.264 HEVC 
format Multi- WD4 

format 

On-Chip 102.5 154KB 124KB 10.2KB 79.9KB 59.6KB 9.0KB 396KB 
SRAM KB 
Logic gates 190k 3454k 715k 446k 1338k 662k 414k 2887k 
technology 28nm/ 28nm/ 40nm/ 90nm/ 65nm/1.2 90nm/ 90nm/ 40nm/ 

0.9v 0.9v 0.9v 01.0v v 1.0v 1.09v 1.0v 
Clock rate 355 350 200 224 340 175 210 300 

MHz MHz MHz MHz MHz MHz MHz MHz 
DRAM DDR3L 32bLPDD 32bDDR3_ n/a 64bDDR2 64bDDR1 n/a 64bDDR3 
config R3 





6. CONCLUSION 

This paper shows an efficient designing of an h.264 decoder with its intra prediction unit (luma and 
chroma). For this process, we have used Xilinx VCU1525 development kit with on board verification using 
ARM CPU. By using runtime memory allocation, individual function improvement and cross function 
verification, we have achieved complete improvement in designing. This improvement provides good area 
optimization with maximum frequency of 355 MHz. We have used different block types for intra prediction 
unit to reduce the area cost of the decoder. By applying various improvement, we have achieved a great 
throughput. 
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